Red wine quality dataset exploration

This report explores a dataset containing red wine quality on 11 different variables.

Univariate Plots Section

The dataset consists of 13 variables of 1599 observations.

## 'data.frame':    1599 obs. of  15 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.factor      : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ quality.cat         : Factor w/ 3 levels "bad","Average",..: 2 2 2 2 2 2 2 3 3 2 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality      quality.factor  quality.cat  
##  Min.   : 8.40   Min.   :3.000   3: 10          bad    :  63  
##  1st Qu.: 9.50   1st Qu.:5.000   4: 53          Average:1319  
##  Median :10.20   Median :6.000   5:681          good   : 217  
##  Mean   :10.42   Mean   :5.636   6:638                        
##  3rd Qu.:11.10   3rd Qu.:6.000   7:199                        
##  Max.   :14.90   Max.   :8.000   8: 18

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

It looks like most wines were rated average between 5 and 6 with a mean of 5.6 and median 6. No wine was rated 1 and none was given 9 or 10. A very small percentage is rated 8 and even a smaller percentage is rated 3. It would be interesting to compare qualities of these two categories (perhaps 3 vs 8) to see if there is stark difference in specific attributes.

For now, let’s look at other variables to see what we can find.

## # A tibble: 6 × 6
##   quality  mean_alc median_alc min_alc max_alc     n
##     <int>     <dbl>      <dbl>   <dbl>   <dbl> <int>
## 1       3  9.955000      9.925     8.4    11.0    10
## 2       4 10.265094     10.000     9.0    13.1    53
## 3       5  9.899706      9.700     8.5    14.9   681
## 4       6 10.629519     10.500     8.4    14.0   638
## 5       7 11.465913     11.500     9.2    14.0   199
## 6       8 12.094444     12.150     9.8    14.0    18

Mean alcohol content in highly rated wines is much higher than wines rated low.

We can see that majority of the wines had an alcohol content between 9.5 and 10 which is interesting because based on common knowledge most wines have alcohol content of 13.5% alc/vol. This indicates most were low in alcohol content which could indicate why most have not been rated very high.

Fixed acidity shows a normal plot.

Let’s look at volatile acidity next since a high concentration of that will result in a vinegary taste in wines and in theory affects quality.

## # A tibble: 6 × 6
##   quality mean_acid median_acid min_acid max_acid     n
##     <int>     <dbl>       <dbl>    <dbl>    <dbl> <int>
## 1       3 0.8845000       0.845     0.44    1.580    10
## 2       4 0.6939623       0.670     0.23    1.130    53
## 3       5 0.5770411       0.580     0.18    1.330   681
## 4       6 0.4974843       0.490     0.16    1.040   638
## 5       7 0.4039196       0.370     0.12    0.915   199
## 6       8 0.4233333       0.370     0.26    0.850    18

We can see that mean and median acidity levels decrease as quality ratings increase.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

We do see some that are highly acidic > 1 and some >1.5 but for the most part they lie around 0.5 gm/l

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

We have a lot of wines with fairly low sugar content between 1.5 and 2.5 with 2 gm/liter as being the most common. We don’t seem to have any wines considered “very sweet” (>45g/l) since the sugar content ranges from 0.9 to 15.5 gm/liter only. Let’s do a log transform to normalize the data.

Also, We do notice a small batch of wines with sugar content >8 but they are very small in number. Let’s zoom in to take a closer look.

2 wines rated good have low sugar content and the one rated bad has high sugar. But, we do have wines rated average with sugar content varying across the spectrum and is a definite candidate for a bivariate analysis. Next, let’s look at a few other variables.

Most don’t differ too much in density. Since density is affected by alcohol and sugar content, we’ll table this for now.

We see almost a bimodal distribution.

Majority of wines have a varying amount of citric acid in tiny amounts. We find a lot of wines having little to no citric acid. More than 110 wines have no citric acid and have been rated average.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Majority appear to have low quantities of chloride mostly between possibly 0.01 and less than 0.2. Let’s zoom in on these ignoring the tail for now.

Quite apparent that most wines have chloride values between 0.05 and 0.1. Now let’s look at the tail.

We see a handful of wines with a high chloride aka salt content. We have most of these rated average and just a couple rated good. Therefore, it looks like though high chloride content doesn’t contribute to good wines, it doesn’t always lead to bad wines either.

Let’s look at free sulphur dioxide since SO2 affects oxidation which in turn is known to affect quality of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

It looks like we do have wines with >50 mg/l of SO2. The dataset explains that with concentrations higher than that, it affects the taste and smell of wine. Common knowledge does indicate SO2 has a foul smelling odor so let’s investigate this.

We have about 6 wines with a high free SO2 content. Surprisingly, none of them were rated bad!

Let’s then take a look at total SO2 and apply log transformation to get a normal curve.

We note that a small number of wines have total SO2 content around 289 but let’s focus on the high tail. At 30 mg/l, close to 300 wines seem to have that amount.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Looking at pH and sulfates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Bulk of the wines have a pH between 3.2 and 3.5 which agrees with known fact that most wines are fairly acidic with range between 3-4 on a pH scale. We do see one wine close at3.9 and another wine as low as 2.7.Both were rated “bad”.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

suphates contributes to SO2 gas which prevents oxidation and also acts as antimicrobial according to given information. The concentration varies from 0.3 to 1.4 for most wines. We do see some high concentrations of around 2 in some wines.

Univariate Analysis

What is the structure of your dataset?

There are 1599 red wines that have been rated on a scale from 1 to 10, 10 being the best and 1 being the worst. 11 different input variables (Ex.pH, alcohol content etc) have been tested for and the output variable (quality) has been measured. All the input variables are numeric while output variable is int.

Other observations: 1. No wines have been given either a very poor rating (1) or very excellent rating (9 or 10) 2. Average rating of the wines is 5.6 with a median of 6. Least rating is 3 and max rating is 8.

What is/are the main feature(s) of interest in your dataset?

Based on common knowledge, factors such as sugar, alcohol and acidity affect the taste and quality. But it would be interesting to plot a scatter plot matrix to understand effect of other variables as well as between input variables.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think sulphates, pH and sulphur di oxide also affect quality.

Did you create any new variables from existing variables in the dataset?

Yes. Since quality is numeric, I changed it to a qualitative variable - a factor variable named quality.factor for ease of analysis

rw$quality.factor <- factor(rw$quality)
rw$quality.cat <- NA
rw$quality.cat <- ifelse(rw$quality>=7, 'good', NA)
rw$quality.cat <- ifelse(rw$quality<=4, 'bad', rw$quality.cat)
rw$quality.cat <- ifelse(rw$quality==5, 'Average', rw$quality.cat)
rw$quality.cat <- ifelse(rw$quality==6, 'Average', rw$quality.cat)

rw$quality.cat <- factor(rw$quality.cat, levels = c("bad", "Average", "good"))

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

residual sugar and total sulfur di oxide was not normal data so I transformed it using the log function in order to normalize the data.

Bivariate Plots Section

Drawing a sample of 1000, let’s look at ggpairs plots.

With respect to quality, there’s a moderate positive correlation to alcohol content (0.504) and a weak positive correlation to citric acid (.223) and sulphates concentration (.267).

There is a weak negative correlation of quality to volatile acid (-.377) and total sulphur di oxide (-.196).

Surprised to learn pH is not highly correlated to quality (-0.068) and neither is chlorides (-0.129)

Relationship between variables

Interestingly, there is strong positive correlation between fixed acidity and density (.658) as well as between citric acid and fixed acidity (.665). The latter maybe just a result of acidic content measurement since citric acid content would contribute to the acidic content. The former would be interesting to explore. Fixed acidity is highly positively correlated to density and highly negatively correlated to pH as shown.

Let us analyze the quantitativeness of the variables using boxplots.

We can see box plots of density and pH are almost mirror images of each other. Let’s further explore scatter plots of just these variables of interest.

We see most wines rated bad are low in alcohol and most wines rated good are high in alcohol content. But, we also see wines with high alcohol content rated average as shown by the outliers. Therefore, clearly some other factors also contribute to quality.

Regards to total sulphur dioxide content, most rated average have low levels. Therefore it is likely that this variable was not a major contributing factor affecting quality. Unless we have more data (i.e more wines rated bad and good in the data set) and we are able to analyze So2 content there, we can table this.

Let’s further explore the relationship between variables. I think pH and acidity having a negative correlation is expected given that pH is a measure of acidic content.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

We see most wines rated bad are low in alcohol and most wines rated good are high in alcohol content. But, we also see wines with high alcohol content rated average as shown by the outliers. Similar was the case with citric acid where presence in fair amounts is good but a lot of it doesn’t necessarily mean good wine and the absence of it also doesn’t imply bad wine quality. Therefore, clearly a lot of factors play together into determining quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Based on common knowledge and what was a given information, I assumed a high concentration of sulfur di oxide (>50 mg/l) would lead to bad wines but that was not so when I drew up the plot of free so2 vs quality.

What was the strongest relationship you found?

The strongest relationship (for quality) was definitely to the alcohol content. For wines with high alcohol content, there were other factors that played into determining quality but there were no wines rated good that were low in alcohol content which strengthened the results.

Multivariate Plots Section

## $title
## [1] "citric acid vs alcohol by quality"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"

Looking at 1st and 3rd plots, wines rated good generally have higher citric acid content,implying fresher wines and some that had low citric acid content but rated higher, had high alcohol content.

Regression model

## 
## Calls:
## m1: lm(formula = I(rw$quality) ~ I(rw$alcohol), data = rw)
## m2: lm(formula = I(rw$quality) ~ I(rw$alcohol) + rw$citric.acid, 
##     data = rw)
## m3: lm(formula = I(rw$quality) ~ I(rw$alcohol) + rw$citric.acid + 
##     rw$sulphates, data = rw)
## m4: lm(formula = I(rw$quality) ~ I(rw$alcohol) + rw$citric.acid + 
##     rw$sulphates + rw$volatile.acidity, data = rw)
## 
## ===================================================================
##                           m1         m2         m3         m4      
## -------------------------------------------------------------------
##   (Intercept)           1.875***   1.830***   1.434***   2.646***  
##                        (0.175)    (0.171)    (0.176)    (0.201)    
##   I(rw$alcohol)         0.361***   0.346***   0.338***   0.309***  
##                        (0.017)    (0.016)    (0.016)    (0.016)    
##   rw$citric.acid                   0.730***   0.513***  -0.079     
##                                   (0.090)    (0.093)    (0.104)    
##   rw$sulphates                                0.814***   0.696***  
##                                              (0.107)    (0.103)    
##   rw$volatile.acidity                                   -1.265***  
##                                                         (0.113)    
## -------------------------------------------------------------------
##   R-squared                0.227      0.257      0.284      0.336  
##   adj. R-squared           0.226      0.256      0.282      0.334  
##   sigma                    0.710      0.696      0.684      0.659  
##   F                      468.267    276.595    210.501    201.777  
##   p                        0.000      0.000      0.000      0.000  
##   Log-likelihood       -1721.057  -1688.711  -1659.955  -1599.093  
##   Deviance               805.870    773.917    746.576    691.852  
##   AIC                   3448.114   3385.421   3329.910   3210.186  
##   BIC                   3464.245   3406.930   3356.795   3242.448  
##   N                     1599       1599       1599       1599      
## ===================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I think it was understanding the relationship between alcohol and citric acid that both contributed to good quality wines. Also, learning that although a good concentration of both is important for a good quality, too much of it doesn’t necesssarily imply that you would get a good wine.

Were there any interesting or surprising interactions between features?

The interaction between sulphates and alcohol content was interesting. Wines rated bad were mostly clustered around the bottom left with low concentration of sulphates and on the right were mostly rated average and good with most of the good wines clusterted on the top right with high concentrations of both.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Yes, I created a regression model to predict the quality of wines based on selected independent variables in the dataset.

The model explains about 34% of cases with the highest R quared when 4 variables (alcohol, citric acid, sulphates and volatile acidity ) are included.

There are some limitations of this model given that ggpairs is based off of a sample of the dataset and any additional data might alter the results. Also, this dataset is for nearly 1600 wines with very few wines on either side of the quality spectrum so any additions to this especially adding to very low quality (3) or very high quality (8 or above) is likely to affect the results.

Final Plots and Summary

Plot One

Description One

We can see how wines rated bad(red) are around the bottom left and wines rated good are clustered around the top right (green). We have average wines clustered around the middle. This provides an explanation to an extent between the interaction between some of the variables that affect quality.

Plot Two

## List of 2
##  $ axis.text.x:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 5
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.y:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 12
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

Description Two

The box plots clearly depict the variation of each of the factors with quality. We can see how the mean moves higher with increasing quality for variables - alcohol, citric acid and sulphates. We can also see the reverse scenario for variables such as volatile acidity and density.

Plot Three

Description Three

Pitting acid against alcohol, we can see good wines (blue) clustered mostly on top of the plot, average in the middle (green) and bad wines (in red) clustered around the bottom. It is evident citric acid, results in fresh tasting wines and therefore, better quality wines. However, too much of it doesn’t necessarily imply good wines as can be seen from a green point on the top right with both high citric acid and high alcohol content.

We see a lot of good wines centered in the middle (alcohol content between 10 and 13) but on the left, where alcohol content is low, we see mostly wines rated bad and average. Also, on the right area of the plot where alcohol content is high, we mostly see good and average wines but not bad wines. Definitely, we can derive that alcohol content is a good predictor of quality of wine. ——

Reflection

It was interesting to see some affirmations for some assumptions such as alcohol content and citric acid yielding good quality wines. It was surprising to learn how some variables such as sugar not really affecting quality as much. Perhaps there is a lot of difference between what we know as a fact versus what we perceive.

Since there was little data on both good and bad wines but more on average wines, it was hard to delineate a pattern from such limited information.

Also, the quality of these wines have been based on what can be measured quantitavely but other factors such as color and smell which are also huge determining factors in deciding quality would be interesting to analyze.

References

https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html https://discussions.udacity.com/t/ggpairs-function/287231/11 https://rpubs.com/jeknov/redwine https://discussions.udacity.com/t/spoiler-code-need-helping-knitting-my-rmd-file-into-an-html-doc/294489